The practical will be run using RStudio cloud to avoid individual machine troubleshooting. Therefore, you need to have an account for RStudio cloud.
However, remember to have R (https://cran.r-project.org) and R-studio (http://www.rstudio.com/)
in your/room computer. In that way, you can work on the project after
class.
This practical aims to implement what you have learned in the class
about Exploratory Data Analysis in R. You should
have read chapter 18 of Crawley (2012) the R book
R.R.R.A 3h practical is divided into two parts. DO NOT COPY AND
PASTE THE CODE! Write your own R-code and run it.
That is the best way to learn how this program works. Also, you have an
excellent teaching assistant resource - ask him and/or me all the
questions you have.
The text in the blue box marked as Your task: states what you need to do
# The code in the window marked like this shows where you need to write your code
#
# If I ask you to assess the result or answer a question, you should write the text here as a comment starting with two hashes (##)
Submit your knitted R-markdown file (the HTML) via
BrightSpace - Before the next practical! The
assessment is a Pass/Fail depending on how you write and annotate the
code - You need to show us you know what you are doing and NOT copying
someone else’s code.
Today you will implement a Classification/Regression algorithm to generate a model that can help you predict which “land-cover” class is the most likely to be observed given an observed hyperspectral signature (the reflectance of the bands captured in a hyperspectral image). You will explore various supervised classification algorithms Classification and Regression Trees (CART), Random Forest (RF), and Boosted regression trees (BRT). The practical goal is that you realise that the choice of algorithm WILL affects the results. Also, today’s practical will build on the work done a few weeks back, where you build an unsupervised classification of a satellite image.
As for the classification exercise, you will need to load a
series of raster images. For this, you will use the
raster R package.
The first step is to load ALL THE BANDS in the
Landsat image in the file Aarhus_2019_utm_L8.tif into
R. For this, you will use the function stack
to create a RasterStack.
Your task:
Load ALL the bands from the Aarhus_2019_utm_L8.tif, located in R-Studio cloud data [see the Files tab in the lower right window].
For this, you will need to:
Define an object named Dir.Loc with the string of the R-studio cloud location.
Use stack() to load the file and save it as an
object named landsat8 - The file is a multi-band GeoTIFF.
Using the names() function, Now change the names of
the bands in landsat8 as: Band 1 = “Coastal”; Band 2 =
“Blue”; Band 3 = “Green”; Band 4 = “Red”; Band 5 = “NIR”; Band 6 =
“SWIR1”; Band 6 = “SWIR2”.
# Load the required library (raster)
library("raster")
# Use `stack()` to load the file and save it as an object named landsat8.
landsat8 <- stack("Data/Aarhus_2019_utm_L8.tif")
# Change the names of landsat8 using the names() function.
names(landsat8) <- c("Coastal", "blue", "green", "red", "NIR", "SWIR1", "SWIR2")
# Print the output of the landsat8 object.
landsat8
## class : RasterStack
## dimensions : 332, 493, 163676, 7 (nrow, ncol, ncell, nlayers)
## resolution : 30, 30 (x, y)
## extent : 564270, 579060, 6219660, 6229620 (xmin, xmax, ymin, ymax)
## crs : +proj=utm +zone=32 +ellps=GRS80 +units=m +no_defs
## names : Coastal, blue, green, red, NIR, SWIR1, SWIR2
## min values : 0, 0, 0, 0, 0, 0, 0
## max values : 0.3477450, 0.3674900, 0.4233425, 0.4276600, 0.5316375, 0.5013050, 0.4579925
Now that you have a RasterStack with the Coastal, blue,
green, red, NIR, SWIR1 and SWIR2 bands (the names of bands 1 to 7), you
will create a TRUE colour a FALSE colour image for the area of
interest.
Your task:
Plot both a true and false colour image of the region of interest
using the plotRGB() function specifying in both a linear
and histogram stretch.
# Use the plotRGB() function with a linear stretch to make the true colour composites - these follow an RGB order
plotRGB(landsat8[[c(4, 3, 2)]], # True colour composites flow a RGB order.
r = 1, g = 2, b = 3, # Define which band is red, green, and blue.
scale = 1, # defined the maximum value a band can take (used to scaling when colouring).
main = "Landsat True Color Composite\nlin stretch", # add a title to the figure.
stretch = "lin", # How to stretch the values to increase the contrast?
margins = TRUE
)
# Use the plotRGB() function with a linear stretch to make the false colour composites - these follow an a NRI, red, green order.
plotRGB(landsat8[[c(5, 4, 3)]], # True colour composites flow an RGB order.
r = 1, g = 2, b = 3, # Define which band is red, green, and blue.
scale = 1, # defined the maximum value a band can take (used to scaling when colouring).
main = "Landsat False Color Composite\nlin stretch", # add a title to the figure.
stretch = "lin", # How to stretch the values to increase the contrast?
margins = TRUE
)
Based on the image above, you can have a good perspective on where you can observe particular land cover classes.
In supervised classification, you have prior knowledge about some of the features of interest (in this case, land-cover types). In the context of satellite images, this prior knowledge can come from, for example, fieldwork, reference spatial data or interpretation of high-resolution imagery (such as available on Google maps). Specific sites in the study area that represent homogeneous examples of these known land-cover types are identified. These areas are commonly referred to as training sites because the spectral properties of these sites are used to train the classification algorithm.
The next section of the practical will focus on:
As stated above, supervised classification is based on the principle
that you have prior knowledge about some of the features of interest. In
this case, you will use BASEMAP
classification for the year 2018 to identify the spectral signal of
predefined land cover classes. The BASEMAP is a 30m multiple sources
land cover database with predictions for four epochs (2011, 2016 and
2018). the file BASEMAP2018.tif has one set of class values
and names that correspond to the levels of land use and land cover
classification (Described in the Aggregated Legend.csv
file).
Your task:
Load the BASEMAP data into R. As with the Landsat image
used before, the BASEMAP data in the file BASEMAP-2018.tif
is located in R-Studio cloud data [see the Files tab in the lower right
window].
# Load the file and save it as an object named BASEMAP2018
BASEMAP2018 <- raster("Data/BASEMAP-2018.tif")
You can now plot the BASEMAP and one image representing
the 2018 clasification.
Your task:
Plot the Land Cover classification raster. Here, define a 34 values
contrasting colour scheme using the hcl.colors()
function.
# plot the `BASEMAP2018`
image(
x = BASEMAP2018, # raster to plot.
col = sample(hcl.colors(35, palette = "Zissou")) # Define the colour ramp
)
However, it is impossible to determine the land-cover class for each grid cell based on this figure. For this, you will need to define the class names for each class (the text name for each value in the raster). These are in the Aggregated Legend.csv file, and listed below
| Value | Land Cover class |
|---|---|
| 110000 | Building |
| 121000 | Low built up |
| 121110 | Low built up; Building |
| 122000 | High built up |
| 122110 | High built up; Building |
| 123000 | City centre |
| 123110 | City centre; Building |
| 124000 | Other built up |
| 124110 | Other built up; Building |
| 125000 | Industry/business |
| 125110 | Industry/business; Building |
| 126000 | Airport/runway |
| 126110 | Airport/runway; Building |
| 130000 | Recreation area/sports ground |
| 130110 | Recreation area/sports ground; Building |
| 141000 | Road, paved |
| 142000 | Road, not paved |
| 150000 | Railway |
| 150110 | Railway; Building |
| 160000 | Resource extraction |
| 211000 | Agriculture, intensive, temporary crops |
| 212000 | Agriculture, intensive, permanent crops |
| 220000 | Agriculture, extensive |
| 230000 | Agriculture, not classified |
| 311000 | Forest |
| 312000 | Forest, wet |
| 321000 | Nature, dry |
| 321220 | Nature, dry; Agriculture, extensive |
| 322000 | Nature, wet |
| 322220 | Nature, wet; Agriculture, extensive |
| 411000 | Lake |
| 412000 | Stream |
| 420000 | Sea |
| 800000 | Unmapped |
From the table above is clear that there are still many classes. But
if you read the table above carefully, it is clear that the first two
(2) digits group the data in BASEMAP2018 into roughly eight
(8) classes:
| Value | Land Cover class |
|---|---|
| 11 & 12 | Build |
| 13 | Recreation |
| 14& 15 | Infrastructure |
| 16 | Mining |
| 21 | Agriculture |
| 31 & 32 | Nature |
| 41 & 42 | Aquatic |
| 80 | Unmapped |
Your task:
Reclassify BASEMAP2018 into these seven categories using
the reclassify()function
# Create a table to reclassify
rclmat <- matrix(c(
109999, 129999, 1, # Build
129999, 139999, 2, # Recreation
140000, 159999, 3, # Infrastructure
160000, 169999, 4, # Mining
200000, 290000, 5, # Agriculture
300000, 390000, 6, # Nature
400000, 499999, 7, # Aquatic
500000, Inf, NA
), # Unmapped
ncol = 3, byrow = TRUE
)
# reclassify the values into seven groups
BASEMAP2018ReCls <- reclassify(
BASEMAP2018,
rclmat
)
# plot the reclassify raster
image(
x = BASEMAP2018ReCls, # raster to plot.
col = c("grey", "light green", "black", "orange", "dark green", "blue")
)
Now you will use the function ratify from the
raster package to define the BASEMAP2018ReCls
RasterLayer as a categorical image. I recommend that you
take a step back and read more about what ratify does and
how to use the function by using the command ?ratify.
Your task:
Make the BASEMAP2018ReCls a Raster Attribute Table using
the function ratify.
Then, plot this “categorical” raster using the function
image and manually adding a legend for the land cover
classes.
# Re-define the BASEMAP2018ReCls RasterLayer as a categorical variable using the ratify function.
BASEMAP2018ReCls2 <- ratify(x = BASEMAP2018ReCls)
## Now define the levels for each value in the ratified RasterLayer
# Create a data.frame with the Raster levels and the Names
rat <- data.frame(
ID = c(1:7, NA), # The attributes values.
landcover = c("Build", "Recreation", "Infrastructure", "Mining", "Agriculture", "Nature", "Aquatic", "Unmapped") # The name-string for each class
)
# Attach the attributes lookup table data.frame (rat) as the levels of the ratified raster.
levels(BASEMAP2018ReCls2) <- rat
# Define the Plotting Space
par(mar = c(3, 3, 4, 7))
# Plot the the ratified raster using the Image function. Use the hex colours in classdf - but remember that one possible value (6) is not defined in classdf and hence has no colour.
image(BASEMAP2018ReCls2,
main = "Land cover classification 2018", # Give the Name
col = c("grey", "light green", "black", "red", "orange", "dark green", "light blue")
)
# Add a legend linking land cover class names to the colours.
legend("topright", # Define the position the legend.
fill = c("grey", "light green", "black", "red", "orange", "dark green", "light blue", "white"), # Define the colours of each land cover class.
legend = c("Build", "Recreation", "Infrastructure", "Mining", "Agriculture", "Nature", "Aquatic", "Unmapped"), # Define each land cover class name.
cex = 0.7, # Define the text size
xpd = NA
) # Plot outside the Figure region
# Plot the the ratified raster using the rasterVis package
rasterVis::levelplot(
x = BASEMAP2018ReCls2, # Raster to plot?
col.regions = c("grey", "light green", "black", "red", "orange", "dark green", "light blue", "white"), # Use the classcolor variable in classdf to define the colours.
main = "Land Cover Clasifiction" # Main Title
)
Training and/or validation data can come from a variety of sources.
In this example, you will generate the training and validation sample
sites using the BASEMAP2018 reference
RasterLayer. You will generate the sample sites following a
stratified random sampling to ensure each Land-Use/Land-Cover class is
sampled evenly.
Your task:
Use the function sampleStratified from the
raster package to generate a stratified random sample from
the cell values of BASEMAP2018ReCls.
As a reminder, a stratified random sample allows researchers to
obtain a sample population that best represents the entire population
being studied. The function sampleStratified performs a
proportional random sampling for each stratum (her land cover
classes).
# Define the training sites locations
# Set the random number generator to make the results reproducible.
set.seed(99)
# Sampling
samp2018 <- sampleStratified(
x = BASEMAP2018ReCls, # Raster to be "sampled"
size = 200, # Number of samples per class
na.rm = TRUE, # NA values are removed from random sample?
sp = TRUE # Output as a SpatialPointsDataFrame?
)
# What kind of object is samp2018?
class(samp2018)
## [1] "SpatialPointsDataFrame"
## attr(,"package")
## [1] "sp"
# Explore the data in samp2018 using the head() function.
head(samp2018)
## cell layer
## 1 27046 1
## 2 68825 1
## 3 40305 1
## 4 110799 1
## 5 154563 1
## 6 155500 1
The stratified random sampling output is a
SpatialPointsDataFrame object, which is
R-speak for a point-shapefile ( When creating
samp2018, setting the sp argument to
FALSE will generate a data.frame).
When printing the samp2018 object, you can see there are
two variables. First, the cell column, that contains the
sampled cell numbers from BASEMAP2018ReCls. Second, the
layer column contains the class values (1-7) corresponding
to each Land Cover Class.
Your task:
Before you go any further, plot the BASEMAP2018ReCls2
raster and add the samp2018 object over it.
# Plot the Land cover raster. Use the hex colours in classdf - but remember that one possible value (6) is not defined in classdf and hence has no colour.
image(BASEMAP2018ReCls2,
main = "Land cover classification 2018", # Give the Name
col = c("grey", "light green", "black", "red", "orange", "dark green", " light blue")
)
# Add the sampled points over the land cover raster using the points() function.
points(samp2018, # Define the SpatialPointsDataFrame to plot
pch = 10, # Define the type of point.
col = "red"
)
# Add a legend Showing the points as sampled locations
legend("bottomright", # Define the position the legend.
pch = 10, # Define the point shape for sampled locations.
col = "red",
legend = "Sampled locations.", # Define How you are naming these sampled locations.
cex = 0.7,
xpd = NA
) # Plot outside the Figure region
# OR
# Plot the the ratified raster using the rasterVis package and the points using latticeExtra
rasterVis::levelplot(
x = BASEMAP2018ReCls2, # Raster to plot?
col.regions = c("grey", "light green", "black", "red", "orange", "dark green", "light blue", "white"), # Use the classcolor variable in classdf to define the colours.
main = "Land Cover Clasifiction"
) +
latticeExtra::layer(sp.points(samp2018, pch = 10, col = "red"))
As you can see from the figure above, you have sampled many places of
the BASEMAP2018ReCls Landcover raster. But how can you
be sure that each class has 200 random observations? For this, you
can tabulate the results from your stratified random sampling using the
function table (look into ?table to know how
the function works)
Your task:
Tabulate the results from your stratified random sampling using the
function table (look into ?table to know how
the function works)
# Extract the land cover classes identities for each plot sampled in the stratified random sample.
SampLCC <- samp2018$layer
# Use the table() function to tabulate the number of land cover classes sampled by the `sampleStratified` function
table(SampLCC)
## SampLCC
## 1 2 3 5 6 7
## 200 200 200 200 200 200
The table above clearly shows that each reclassified land cover class has been sampled precisely 200 times.
Here you will go back to the Landsat data used at the beginning of
the tutorial (landsat8) to generate the NDVI layer. As you
now have a series of sites with an associated Land cover class, you can
extract the cell values from the landsat8
RasterStack. These band values will be the predictor
variables and “class values” from BASEMAP2018ReCls will be
the response variable. To extract these, you will use the function
extract.
Your task:
Use the function extract to extract for each randomly
selected site per Landcover class the hyperspectral bands in
landsat8.
# Extract the values from `landsat8`
sampvals <- extract(
x = landsat8, # Define the raster to be samples.
y = samp2018, # Define the spatial vector data with the locations to sample.
df = TRUE # Return the results as a data.frame?
)
# sampvals no longer has the spatial information!!. To keep the spatial information, you use `sp = TRUE` argument in the `extract` function.
# Now some house keeping!!
# drop the ID column
sampvals <- sampvals[, -1]
# combine the class information with extracted values
sampdata <- data.frame(
classvalue = samp2018@data$layer, # Get the Landcover "class" for each selected location.
sampvals # The data.frame with the hyperspectral bands in `landsat8`.
)
# Explore the resulting data.frame using the head() function.
head(sampdata)
## classvalue Coastal blue green red NIR SWIR1 SWIR2
## 1 1 0.05020875 0.05723500 0.07900125 0.08249375 0.2067525 0.1701500 0.12917501
## 2 1 0.01758000 0.02712250 0.05896750 0.05049750 0.2681875 0.1557675 0.09548750
## 3 1 0.03716000 0.04480500 0.07153500 0.06622750 0.3369650 0.1763100 0.09565250
## 4 1 0.03150875 0.04237125 0.05815625 0.06081000 0.0935900 0.1300963 0.11790000
## 5 1 0.02963875 0.03431375 0.05477375 0.04809125 0.2436300 0.1471050 0.09195375
## 6 1 0.02778250 0.03355750 0.06048000 0.05294500 0.2748700 0.1597000 0.10382000
Now you will train the RF classification algorithm using a
sampdata dataset. Here, land cover classes are the response
variable (set up as a factor as these are coded as numbers in the
raster) and the ‘blue’, ‘green’, ‘red’, ‘NIR’, ‘SWIR1’, ‘SWIR2’ bands
are the predictors.
The first question to ask is why not use a classification model?, well have you ever wondered why do we ask multiple people about their opinions or reviews before going to a film, or before buying a car or maybe, before planning a holiday? It’s because the review of one person may be biased as per her/his preference. However, when you ask multiple people, you try to remove the bias a single person may provide. By this, I mean, one person may have an extreme dislike for a specific destination because of her experience at that location; however, ten other people may have a very strong preference for the same destination because they have had a wonderful experience there
In the world of analytics and data science, this is called ‘assembling’. Assembling is a type of supervised learning technique. In this technique, multiple models are trained on a training dataset, and their outputs are combined by some rule to derive the final consensus output. This is the benefit of random forests (RF) when compared to single regression trees.
Random Forest is one very powerful assembling classification algorithm that works by creating multiple decision trees and then combining the output generated by each decision tree. Decision trees are the classification models you applied in the section above, which works on the concept of information gain at every node. For all the data points, the decision tree will try to classify data points at each node and check for information gain at each node. It will then classify at the node where information gain is maximum. It will follow this process subsequently until all the nodes are exhausted or there is no further information gain. Decision trees are easy to understand models. However, they have low predictive power. They are called weak learners. Random Forest works on the same weak learners, but it combines these when it finally comes up with its’ output.
Another point that sets Random Forest apart is that it does use all the data points and variables in each of the trees. It randomly samples data points and variables in each of the trees it creates and then combines the output at the end - this approach is called a bagging method. By doing this, Random Forest removes the bias that a decision tree model might introduce in the system. Also, it improves the predictive power significantly.
Your task:
Using the function randomForest() from the
randomForest package, build a random forest model that
predicts land cover classes based on the hyperspectral signal of sampled
sites.
# Load the required package (randomForest)
library(randomForest)
# Build a random forest model that predicts land cover classes based on the hyperspectral signal of sampled sites.
# Your response variable should be a factor.
RF.model <- randomForest(
formula = as.factor(classvalue) ~ ., # Define a formula linking the response (land cover classes) to the predictors (bands in landsat8) - remember to Define the Land cover class as a factor
data = sampdata, # Define the data frame containing the variables in the model.
ntree = 500, # Define the number of trees to grow. By default, the function builds 500 trees.
mtry = 2, # Define the Number of variables randomly sampled as candidates at each split. By default, the function uses two (2) predictors per tree.
importance = TRUE # Should the importance of predictors be assessed?
)
# Print the randomForest just created
RF.model
##
## Call:
## randomForest(formula = as.factor(classvalue) ~ ., data = sampdata, ntree = 500, mtry = 2, importance = TRUE)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 2
##
## OOB estimate of error rate: 43.58%
## Confusion matrix:
## 1 2 3 5 6 7 class.error
## 1 55 33 76 24 10 2 0.725
## 2 29 89 29 27 26 0 0.555
## 3 70 20 77 20 12 1 0.615
## 5 11 16 13 138 22 0 0.310
## 6 7 25 8 29 130 1 0.350
## 7 5 3 2 0 2 188 0.060
Based on the output above, you can see that the error rate is ~43% which is relatively high. So, to improve this, you might need to build more trees and/or use more predictors - this means tuning the random forest model.
You can tune the random forest model by changing the number of trees
(ntree) and the number of variables randomly sampled at
each stage (mtry). According to the
randomForest package description:
three: Number of trees to grow. This should not be
defined too small to ensure that every input row gets predicted at least
a few times.
mtry: Number of variables randomly sampled as
candidates at each split. Note that the default values are different for
classification (sqrt(p) where p is the number of variables in x) and
regression (p/3).
You can explore how the change in the number of trees will affect the Out-of-bag (OOB) error. You will now build a loop that evaluates how much the error change as the number of trees used in the forest increases.
Your task:
Explore how does the change in the number of trees will affect the Out-of-bag (OOB) error.
For this, build a loop (using an for() loop) that
evaluates how much does the error change as the number of trees used in
the model (defined with the ntree argument) changes between
100, 500, 1000, 2000, 5000
# Using a Loop, evaluate how much the Out-of-bag (OOB) error changes as the number of trees shifts between 100, 500, 1000, 2000, and 5000
# Create a summary table
RF.Error.ntree <- data.frame(
ntree = c(100, 500, 1000, 2000, 5000), # A vector with the Number of trees
OOB = NA # Set the OOB to NA - will be filled for each Number of trees below1
)
# Loop over RF.Error.ntree$ntree to assess the impact of tree number on OOB-error.
for (i in RF.Error.ntree$ntree) {
# Set the random seed for repeatability
set.seed(99)
# Estimate the randomForest for the evaluated Number of trees
RF.mode.tmp <- randomForest(
formula = as.factor(classvalue) ~ ., # Define a formula linking the response (land cover classes) to the predictors (bands in landsat8) - remember to Define the Land cover class as a factor
data = sampdata, # Define the data frame containing the variables in the model.
ntree = i, # Define the number of trees to grow.
mtry = 2, # Define the Number of variables randomly sampled at each split. Use the Default = 2.
importance = TRUE # Should importance of predictors be assessed?
)
# Add the estimated OOB-error to `RF.Error.ntree` - the summary data.frame
RF.Error.ntree$OOB[RF.Error.ntree$ntree == i] <- RF.mode.tmp$err.rate[dim(RF.mode.tmp$err.rate)[1]]
}
# Print the summary data.frame
RF.Error.ntree
## ntree OOB
## 1 100 0.4508333
## 2 500 0.4383333
## 3 1000 0.4425000
## 4 2000 0.4408333
## 5 5000 0.4450000
The change in the Out-of-bag (OOB) error shows worst models as the number of trees increases form 500 trees. The question then is, how would the OOB-error change as a function of the number of predictors used per-tree changes.
Your task:
Explore how the change in the number of predictors used per tree affects the Out-of-bag (OOB) error.
For this, build a loop (using a for() loop) evaluating
how much the error changes as the number of variables randomly sampled
as candidates at each split (defined with the mtry
argument) changes between 1 to 7.
# Using a Loop evaluate how much does the Out-of-bag (OOB) error as the number of variables pre tree increase.
# Create a summary table
RF.Error.mtry <- data.frame(
mtry = 1:7, # A vector with the Number of variables
OOB = NA # Set the OOB to NA - will be filled for each Number of trees below1
)
# Loop over RF.Error.mtry$mtry to assess the impact of tree number on OOB-error.
for (i in RF.Error.mtry$mtry) {
# Set the random seed for repeatability
set.seed(99)
# Estimate the randomForest for the evaluated Number of variables
RF.mode.tmp <- randomForest(
formula = as.factor(classvalue) ~ ., # Define a formula linking the response (land cover classes) to the predictors (bands in landsat8) - remember to Define the Land cover class as a factor
data = sampdata, # Define the data frame containing the variables in the model.
ntree = 500, # Define the number of trees to grow. Use the Default = 500
mtry = i, # Define the Number of variables randomly sampled at each split.
importance = TRUE # Should importance of predictors be assessed?
)
# Add the estimated OOB-error to the summary data.frame
RF.Error.mtry$OOB[RF.Error.mtry$mtry == i] <- RF.mode.tmp$err.rate[dim(RF.mode.tmp$err.rate)[1]]
}
# Print the summary data.frame
RF.Error.mtry
## mtry OOB
## 1 1 0.4450000
## 2 2 0.4383333
## 3 3 0.4325000
## 4 4 0.4275000
## 5 5 0.4341667
## 6 6 0.4291667
## 7 7 0.4383333
The Out-of-bag (OOB) change reaches a minimum when four (4) variables
are used in building a tree. This indicates a need to change the default
values of the randomForest functions produce the best model
for the used training dataset, but we will stick with the default values
as the changes in OOB are not very large.
Although the Out-of-bag (OOB) error values might seem large, what it
is showing you is that the average predictive ability of any single tree
is terrible. This is also reflected in the confusion matrix produced
when printing RF.model. To get some insights on what are
confusion matrices, check out: https://bit.ly/3ltpkyk.
However, it is essential to say that the model predicts perfectly the training dataset. To view this, you need to tabulate the predicted and observed classes and assess the level of misclassification (difference between the observed and predicted value).
Your task:
Tabulate the observed vs. predicted classification and determine the level of misclassification when the model predicts the training dataset.
# predict the values for the training dataset using the predict function
predRF <- predict(
object = RF.model, # Define the randomForest with the model.
newdata = sampdata, # Define data.frame used to make the prediction - here is the training data.
type = "response" # indicating the type of output:
)
# Tabulate the observed vs. predicted classification - this is a type of confusion matrix!
table(predRF, sampdata$classvalue)
##
## predRF 1 2 3 5 6 7
## 1 200 0 0 0 0 0
## 2 0 200 0 0 0 0
## 3 0 0 200 0 0 0
## 5 0 0 0 200 0 0
## 6 0 0 0 0 200 0
## 7 0 0 0 0 0 200
The table above shows that there is zero misclassification in the case of prediction on the training dataset. However, to properly assess how good the model you need to evaluate the model in a “testing dataset that is different (and preferably independent) from the”training” dataset.
The last question to ask is how important is given variable is to
discriminate between land Cover Classes. As you defined the
importance argument when building your Random Forest model,
you can extract these using the importance and
varImpPlot functions from the randomForest
package.
Your task:
Assess the importance of the used predictors to discriminate between land Cover Classes.
# Extract variable importance measures for the best attribute configuration of your randomForest model using the importance() function.
importance(RF.model)
## 1 2 3 5 6 7 MeanDecreaseAccuracy MeanDecreaseGini
## Coastal 0.7357383 28.709203 2.381668 26.62737 9.211609 4.954223 29.99518 118.4440
## blue 6.6011793 13.367544 13.746580 23.91673 21.396853 4.693822 33.06819 120.5938
## green 0.3892797 15.015633 8.848622 23.51980 33.176215 5.212055 33.26567 144.4860
## red 7.8559635 5.773528 21.561917 24.75848 20.633487 7.801037 31.97582 135.3020
## NIR 6.1042268 30.459215 16.922784 41.03353 18.477913 15.429188 31.37197 173.6557
## SWIR1 3.7179664 21.713725 13.306461 30.50167 19.520763 16.638204 25.67379 151.9236
## SWIR2 6.1146173 14.338691 19.880305 28.56497 34.121498 16.259274 30.41735 154.7797
# Plot the variable importance measures for the best attribute configuration of your randomForest model using the varImpPlot() function
varImpPlot(RF.model)
Based on the MeanDecreaseAccuracy, the most important
variable is green, and the least important variable is SWIR1. In the
case of MeanDecreaseGini, the relative change in node
purity is the largest with NRI and the smallest with blue.
Now that you have a classification model, you can evaluate its’
prediction accuracy using a k-fold validation approach. In this
technique, the data used to fit the model is split into k
groups (typically five groups). In turn, one of the groups will be used
for model testing, while you will use the rest of the data for model
training (fitting). For this, you will use the fold
function of the package dismo).
Your task:
Do a five (5) k-fold partitioning of the training dataset, stratifying the folds by the Land Cover Class
# Load the required library (dismo)
library(dismo)
# set the random seed for reproducibility
set.seed(99)
# Do a k-fold partitioning of the training dataset
k.foldDta <- kfold(sampdata, # the source data to use (the sample)
k = 5, # number of Folds (here you use 5)
by = sampdata$classvalue
) # define how to stratify the folds
# Tabulate the k-fold partitioning dataset by the Land Cover Class to see the number of samples per land cover class to be used in each fold.
table(k.foldDta, sampdata$classvalue)
##
## k.foldDta 1 2 3 5 6 7
## 1 40 40 40 40 40 40
## 2 40 40 40 40 40 40
## 3 40 40 40 40 40 40
## 4 40 40 40 40 40 40
## 5 40 40 40 40 40 40
With the k-fold partitioning, you can now train and test your Random Forest models five times - five because you have five-folds!
Your task:
Train (using those points NOT in the k-fold dataset) and test (using those points IN the k-fold dataset) your Random Forest model.
For each of the five (5) folds, compute a confusion matrix and store
it as a component in a list named RF.Kfold.List.
Finish by estimating the average confusion matrix over the five folds.
# Kfold Validation for the RF model
RF.Kfold.List <- lapply(1:5, function(k) {
# Define the Training data.frame - Observations NOT selected in the k-fold.
train <- sampdata[k.foldDta != k, ]
# Define the test data.frame - Observations selected in the k-fold.
test <- sampdata[k.foldDta == k, ]
# build a new model using the Train dataset
RF.model.kfold.Mod <- randomForest(
formula = as.factor(classvalue) ~ ., # Define a formula linking the response (land cover classes) to the predictors (bands in landsat8) - remember to Define the Land cover class as a factor
data = train, # Define the data frame containing the variables in the model.
ntree = 500, # Define the number of trees to grow. Use the Default = 500
mtry = i, # Define the Number of variables randomly sampled at each split.
importance = TRUE # Should importance of predictors be assessed?
)
# Predict the k-fold model
pclass <- predict(
object = RF.model.kfold.Mod, # Define the randomForest with the model.
newdata = test, # Define data.frame used to make the prediction - here is the training data.
type = "response" # indicating the type of output
)
# create a data.frame using the reference and prediction
cbind(
Obsered = train$classvalue,
Predicted = as.integer(pclass)
)
})
# Confusion matrix for RF model the kfold validation
# Merge the tables in the list row wise
RF.Kfold <- do.call(
"rbind",
RF.Kfold.List
)
# Make the matrix a data.frame
RF.Kfold <- data.frame(RF.Kfold)
# Set the column names for the data.frame.
colnames(RF.Kfold) <- c("observed", "predicted")
# Tabulate the (miss)matches between the observed and predicted test datasets
RF.conmat <- table(RF.Kfold)
# Change the name of the classes in the tabulation.
colnames(RF.conmat) <- c("Build", "Recreation", "Infrastructure", "Agriculture", "Nature", "Aquatic")
rownames(RF.conmat) <- c("Build", "Recreation", "Infrastructure", "Agriculture", "Nature", "Aquatic")
# Print the confusion matrix
RF.conmat
## predicted
## observed Build Recreation Infrastructure Agriculture Nature Aquatic
## Build 182 164 186 193 72 3
## Recreation 109 160 106 71 165 189
## Infrastructure 101 68 98 180 165 188
## Agriculture 182 164 186 193 72 3
## Nature 109 160 106 71 165 189
## Aquatic 101 68 98 180 165 188
Your task:
Estimate the miss-classification rate (proportion of observations wrongly classified by the model) for the Random Forest model.
### Miss classification of the RF model
RF.class.error <- sapply(
colnames(RF.conmat),
function(i) {
sum(RF.conmat[i, !colnames(RF.conmat) %in% i])
}
) / apply(RF.conmat, 1, sum)
# Print the miss-classification of the RF model
RF.class.error
## Build Recreation Infrastructure Agriculture Nature Aquatic
## 0.77250 0.80000 0.87750 0.75875 0.79375 0.76500
# Make a barplot of the miss-classification of the RF model
barplot(RF.class.error,
main = " RF model k-fold miss-classification rate",
ylim = c(0, 1)
)
Based on this approach, the within overall model accuracy is similar between classes fo the RF models.
You can also assess the accuracy using a subset of the data - and this is a preferred option and the core of cross-validation. When doing this model test, you get an idea of how accurate the classified map might be. Two widely used measures in remote sensing are “overall accuracy” and “kappa”.
Overall accuracy is the probability that a test will correctly classify an individual; that is, the sum of the true positives plus true negatives divided by the number of individuals tested. In other words, how often do you predict a correct classification?
Your task:
Estimate the overall accuracy (proportion of cases correctly classified) for the Random Forest model.
# Overall accuracy estimation for the RF Model
## number of cases
RF.n <- sum(RF.conmat)
RF.n
## [1] 4800
# number of correctly classified cases per class in the RF model
RF.diag <- diag(RF.conmat)
# Overall Accuracy of the RF model
RF.OA <- sum(RF.diag) / RF.n
RF.OA
## [1] 0.2054167
By comparison, Cohen’s “kappa” coefficient measures the agreement between two matrices who each classify \(N\) items into \(C\) mutually exclusive categories.
A simple way to think this is that Cohen’s Kappa is a quantitative measure of reliability for two matrices that are rating the same thing, corrected for how often the matrices may agree by chance. It is generally thought to be a more robust measure than a simple per cent agreement calculation, as it considers the possibility of the agreement occurring by chance. For extra information on the index, see https://bit.ly/3Ahju9q.
Your task:
Estimate the Cohen’s “kappa” coefficient (overall accuracy corrected by the expected accuracy) for the Random Forest model.
# observed (true) cases per class
## Estimation for the RF model
# Determine the total number of observations OBSERVED for a given land cover class (the column sum)
RF.rowsums <- apply(RF.conmat, 1, sum)
# Define the probability of each OBSERVED land cover class (total-OBSERVED / total number of cases)
RF.p <- RF.rowsums / RF.n
# Determine the total number of observations PREDICTED for a given land cover class (the column sum)
RF.colsums <- apply(RF.conmat, 2, sum)
# Define the probability of each PREDICTED each land cover class (total-PREDICTED / total number of cases)
RF.q <- RF.colsums / RF.n
# Estimate the Expected Accuracy - the sum of probability of each OBSERVED and PREDICTED land cover class.
RF.expAccuracy <- sum(RF.p * RF.q)
# Estimate the Kapa - overall accuracy divided by one MINUS the expected accuracy
RF.kappa <- (RF.OA - RF.expAccuracy) / (1 - RF.expAccuracy)
# Print
RF.kappa
## [1] 0.0465
The Producer’s Accuracy is the map accuracy from the point of view of the mapmaker (the producer). This is how often are real features on the ground correctly shown on the classified map or the probability that a particular land cover of an area on the ground is classified as such. The Producer’s Accuracy is the number of reference sites classified accurately divided by the total number of reference sites for that class.
The User’s Accuracy is the accuracy from the point of view of a map user. The User’s accuracy essentially tells you how often the class on the map will be present on the ground. This is referred to as reliability. The User’s Accuracy is calculated by taking the number of correct classifications for a particular class and dividing it by the row total.
Your task:
Estimate the producer accuracy (number of reference sites classified accurately divided by the total number of reference sites) for the Random Forest model.
# Producer Accuracy - The RF case [correct classification per class / number of reference sites for that class]
RF.PA <- RF.diag / RF.colsums
# The overall PA is the mean of the percentage of correct classifications
mean(RF.PA)
## [1] 0.2053
Your task:
Estimate the user accuracy (total number of correct classifications for a particular class and dividing it by the row total) for the Random Forest model.
# User accuracy - The RF case [correct classification per class / row total for that class]
RF.UA <- RF.diag / RF.rowsums
# The overall UA is the mean of the percentage of correct classifications
mean(RF.UA)
## [1] 0.2054167
Now that you have trained your classification model, you can use it
to make a prediction - even as these might not be the best models!. For
this, you will use the RF-model to classify all cells in
landsat8. It is important that the layer names in the
RasterStack object to be classified exactly match those
used as predictors in the model. This would be the case if the same
Raster object were used (via extract) to obtain the values to fit the
model.
Your task:
Use your Random Forest model to generate predicted land cover maps
for the region of interest (landsat8).
# Use the predict function (as implemented for raster objects and the Random Forest model to generate predictors of land Cover classes for 2011.
RF.pr2018 <- predict(
object = landsat8, # Define the raster used to make the predictions.
model = RF.model, # Define the model used to make the predictions..
type = "response" # Define the type of predicted value returned
)
# Plot the RF model PREDICTED map using the image() function and the HEX colours in classdf - remember that there is no class 6 in classdf, and so it does not have a colour.
image(RF.pr2018,
main = "RF Predicted Land cover classification 2011", # Give the Name
col = c("grey", "light green", "black", "red", "orange", "dark green", "light blue")
)
# Add a legend linking land cover class names to the colours.
legend("topright", # Define the position the legend.
# inset = c(-0.24,0), # Define inset distance(s) from the margins.
fill = c("grey", "light green", "black", "red", "orange", "dark green", "blue", "white"), # Define the colours of each land cover class.
legend = c("Build", "Recreation", "Infrastructure", "Mining", "Agriculture", "Nature", "Aquatic", "Unmapped"), # Define each land cover class name.
cex = 0.7, # Define the text size
xpd = NA
) # Plot outside the Figure region
Focusing on the modes, plotting and counting the misclassified cells (observed != Pred), you can get an overview of the classification error.
Your task:
Show the misclassified cells (colour them RED) and those classified correctly (colour them BLUE) by the Random Forest model.
Also, estimate the proportion of all the cells in
landsat8 that are misclassified cells by the Random Forest
model.
# Use the image function to show the cells that are misclassified (Predicted != Observed) and classified correctly (Predicted == Observed) based on the Random forest model
image(RF.pr2018 == BASEMAP2018ReCls,
main = "RF missclassified cells",
col = c("red", "blue")
)
# Estimate the proportion of misclassified cells (Predicted != Observed)/Ncells for the RF model.
sum((RF.pr2018 != BASEMAP2018ReCls)[], na.rm = T) / ncell(BASEMAP2018ReCls)
## [1] 0.4135609
Although visual inspections are OK, and counts of miss-classified cells show you the quality of the classification. You also want to assess the per-class error rate (the number of observations in a class the algorithm does miss-classify)
Your task:
Estimate the confusion matrix for the rasters predicted using the Random forest model.
## Use the table() function to estimate the confusion matrix for the RF model
RF.ConfMtx <- table(
BASEMAP2018ReCls[],
RF.pr2018[]
)
rownames(RF.ConfMtx) <- colnames(RF.ConfMtx) <- c("Build", "Recreation", "Infrastructure", "Agriculture", "Nature", "Aquatic")
RF.ConfMtx
##
## Build Recreation Infrastructure Agriculture Nature Aquatic
## Build 15377 6051 15295 4458 2219 614
## Recreation 1440 3575 1108 1361 986 23
## Infrastructure 7301 2630 11421 2501 1487 236
## Agriculture 2334 2588 2210 18946 4139 15
## Nature 788 2073 617 3226 12207 148
## Aquatic 330 148 408 209 747 29034
# Define the per-class error rate [proportion of of the cells for a given land cover class correctly classified]
RFclass.error <- sapply(
colnames(RF.ConfMtx),
function(i) {
sum(RF.ConfMtx[i, !colnames(RF.ConfMtx) %in% i])
}
) / apply(RF.ConfMtx, 1, sum)
RFclass.error
## Build Recreation Infrastructure Agriculture Nature Aquatic
## 0.65063389 0.57906511 0.55344855 0.37331305 0.35951519 0.05965799
Based on this is clear that your model is very good in discriminating areas classified as aquatic, and does a bad job in classifying areas classified as Build
Supervised classification is the technique most often used for the quantitative analysis of remote sensing image data. At its core is the concept of segmenting the spectral domain into regions that can be associated with the ground cover classes of interest to a particular application.
For example, you could use the PCA axes instead of the raw data in the Landsat image. Alternatively, you could use more points to develop your model (right now, you use 200 points that are not even 1% of the available data - more data gives you a better model but requires more computer power!!). I leave these tasks to the interested student! This exercise gave you a guide of the procedure to do so, and although your final model is not the best, it is the starting point for developing this type of analysis.